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O ' Abstract 



We present the results of a feasibility study using shared, existing network-accessible infrastructure for 
I repository replication. Our goal is not to "hijack" other sites' storage, but to take advantage of protocols 

which have persisted through many generations and which are likely to be supported well into the future. 
We utilize the SMTP and NNTP protocols to replicate both the metadata and the content of a digital 
library, using OAI-PMH and the related Apache web server module, mod_oai, to facilitate management 
. of the replication process. We investigate how dissemination of repository contents can be piggybacked 

on top of existing email and Usenet traffic. Long-term persistence of the replicated repository may be 
^ , achieved thanks to current policies and procedures which ensure that email messages and news posts 

are retrievable for evidentiary and other legal purposes for many years after the creation date. While 
the preservation issues of migration and emulation are not addressed with this approach, it does provide 
^ , a simple method of refreshing content with various partners for smaller digital repositories that do not 

QQ ' have the administrative resources for more sophisticated solutions. 

o 

■ 1 Introduction 
O' 

' We propose and evaluate two repository replication models that rely on shared, existing network-accessible 

I infrastructure. Our goal is not to "hijack" other sites' storage, but to take advantage of protocols which 

C/3 . have persisted through many generations and which are likely to be supported well into the future. The 

premise is that if archiving can be accomplished within a widely-used, already deployed infrastructure whose 
operational burden is shared among many partners, the resulting system will have only an incremental 
KJi I cost and be tolerant of dynamic participation. With this in mind, we examine the feasibility of repository 

! replication using Usenet news (NNTP, [1]) and email (SMTP, [2]). 

' There are reasons to believe that both email and Usenet could function as persistent, if diffuse, archives. 

NNTP provides well-understood methods for content distribution and duplicate deletion (deduping) while 
supporting a distributed and dynamic membership. The long-term persistence of news messages is evident 
in "Google Groups," a Usenet archive with posts dating from May 1981 to the present [3]. Even though 
blogs and forums have supplanted Usenet in recent years, many communities still actively use moderated 
news groups for discussion and awareness. Although email is not usually publicly archivable, it is ubiquitous 
and frequent. For example, our departmental SMTP email server alone averaged over 16,000 daily outbound 
emails to more than 4000 unique recipient servers during a 30-day test period. Given enough time, attaching 
repository contents to outbound emails may prove to be an effective way to disseminate contents to previously 
unknown locations. Open source products for news ("INN") and email ("sendmail" and "Postfix") are 
widely installed, so including a preservation function would not impose a significant additional administrative 
burden. 

These approaches do not address the more complex aspects of preservation such as format migration and 
emulation, but they do provide alternative methods for refreshing the repository contents to a variety of 
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recipients, known and unknown. There may be quicker and more direct mcrthods of synchronization for some 
repositories, but the proposed methods have the advantage of working with firewaU-inhibited organizations 
and repositories without pubUc, machine-readable interfaces. For example, many organizations have web 
servers which are accessible only through a VPN, yet email and news messages can freely travel between these 
servers and other sites without compromising the VPN. Piggybacking on mature software implementations 
of these other, widely deployed Internet protocols may prove to be an easy and potentially more sustainable 
approach to preservation. 

2 Related Work 

Digital preservation solutions often require sophisticated system administrator participation, dedicated 
archiving personnel, significant funding outlays, or some combination of these. Some approaches, for ex- 
ample Intermemory [4], Freenet [5], and Free Haven [6], require personal sacrifice for public good in the 
form of donated storage space. However, there is little incentive for users to incur such near-term costs 
for the long-term benefit of a larger, anonymous group. In contrast, LOCKSS [7] provides a collection of 
cooperative, deliberately slow-moving caches operated by participating libraries and publishers to provide 
an electronic "inter- library loan" for any participant that loses files. Bcicause it is designed to service the 
publisher-library relationship, it assumes a level of at least initial out-of-band coordination b(^tween the 
parties involved. Its main technical disadvantage is that the protocol is not resilient to changing storage 
infrastructures. The rsync program [8] has been used to coordinate the contents of digital library mirrors 
such as the arXiv eprint server but it is based on file system semantics and cannot easily be abstracted to 
other storage systems. Peer-to-peer services have been studied as a basis for the creation of an archiving co- 
operative among digital repositories [9] . The concept is promising but their simulations indicated scalability 
is problematic for this model. The Usenet implementation [10] of the Eternity Service [11] is the closest to 
the methods we propose. However, the Eternity Service focuses on non-censorable anonymous publishing, 
not preservation per se. 

3 The Prototype Environment 

We began by creating and instrumenting a prototype system using popular, open source products: Fedora 
Core (Red Hat Linux) operating system; an NNTP news server (INN version 2.3.5); two SMTP email 
servers. Postfix (version 2.1.5) and sendmail (version 8.13.1); and an Apache web server (version 2.0.49) with 
the mod-oai module installed [12]. Figure 1 illustrates the prototype environment we installed. No server 
was dedicated to news or mail; they also provided services to other users, including project development 
environments, operational software, and web services, mod-oai is an Apache module that provides Open 
Archives Protocol for Metadata Harvesting (OAI-PMH) [13] access to a web server. Unlike most OAI-PMH 
implementations, mod_oai does not just provide metadata about resources, it can encode the entire web 
resource itself in MPEG-21 Digital Item Declaration Language [14] and export it through OAI-PMH. 

There are many kinds of digital libraries and a wide variety of repository file formats in use today. Web 
access to library content is becoming more common, keeping pace with Internet growth and facilitated by the 
many tools which convert hitherto proprietary content to HTML, PDF, or other web-compatible formats. 
In keeping with this trend toward Internet accessibility, we created a small repository of web resources 
consisting of 72 files in HTML, PDF and various image (GIF, JPEG, and PNG) formats. We used our own 
synthetic web site creation tool, building small HTML pages containing a table, some random text, and a 
few images as well as links to other pages in the web site. The PDF files were simple text pages. The files 
were organized into a few subdirectories with file sizes ranging from less than 1 kilobyte up to 1.5 megabytes, 
and the total web site size was approximately 30 MB. 

For the NNTP part of the experiment, we configured the INN news server with common default param- 
eters: messages could be text or binary; maximum message life was 14 days; and direct news posting was 
allowed. For email, we did not impose restrictions on the size of outgoing attachments and messages. We 
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Figure 1: The prototype environment 
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Table 1: Example of human-readable X- headers added to archival messages 



X-Harvest_Time : 2006-2-15T18 : 34 : 51Z 

X-baseURL : http : //beatitude . cs . odu . edu : 8080/inodoai/ 

X-OAI-PMH_verb: GetRecord 

X-OAI-PMHjnetadataPref ix: oai_didl 

X-OAI-PMH_Identif ier : http : //beatitude . cs . odu. edu: 8080/1000/pglOOO-l .pdf 
X-sourceURL : http : //beatitude . cs . odu . edu : 8080/modoai/?verb=GetRecord 
feidentif ier=http : //beatitude . cs . odu. edu: 8080/1000/pglOOO-l .pdf 
&met adat aPr ef ix=oai jdidl 
X-HTTP-Header: HTTP/ 1.1 200 OK 



created the archive messages within the Postfix environment and sent / received the messages using sendmail. 
Using custom NNTP and SMTP tools written mainly in Perl and which were operated from remote clients, 
we harvested the entire repository over 100 times with each tool. 

We took advantage of OAI-PMH and the flexibility of email and news to embed the URL of each record 
as an X-Hcadcr within each message. X-Headers are searchable and human-readable, so their contents give 
a clue to the reader about the purpose and origin of the message. Since we encoded the resource itself in 
base64, this small detail can be helpful in a forensic context. If the URL still exists, then the X-Headers 
could be used to re-discover the original resource. Table 1 is a set of actual X-Headers added to an archival 
message, to facilitate discovery and recovery of the replicated record. Both the NNTP and SMTP repository 
harvesting methods use the following algorithm: 



for r = 1 TO R 

read repository record r 

format r (mail or news) 

r = base64(r) 

r = r+ X-headers 

transmit r 
end-f or 



Figure 1 graphically illustrates the process for each replication method. In sections 3.1 and 3.2 we discuss 
the specific details of, and differences between, using news and email for repository replication. 

3.1 The News Prototype 

A testament to Internet diversity, Usenet groups exist in many formats. For our experiment, we created 

a moderated newsgroup which means that postings must be authorized by the newsgroup owner. This is 
one way newsgroups keep spam from proliferating on the news servers. We also restricted posts to selected 
IP addresses and users, further reducing the "spam window" and ensuring our live experiment would not 
be compromised by external news agents and Usenet enthusiasts. Since the news server was running on a 
live system used by many people not participating in the project, controlling access was important. For the 
experiment, we named our newsgroup "repository.odu.testl," but groups can have any naming scheme that 
makes sense to the members. For example, a DNS-based scheme that used "repository.edu.cornell.es" or 
"repository.uk. ac.soton.psy" would be a reasonable naming convention. 

Using the algorithm outlined above, we created a news message for each record in the repository (Cf. 
Appendix). We also collected statistics on (a) original record size vs. posted news message size; (b) time to 
harvest, convert and post a message; and (c) the impact of line length limits in news posts. Our experiment 
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Figure 3: Email distribution by domain follows a power law 



showed high reliability for replicating using NNTP. 100% of the records arrived intact on the target news 
server, "beatitude." In addition, 100% of the records were almost instantaneously mirrored on a subscribing 
news server ("beaufort"). A network outage during one of the experiments temporarily prevented commu- 
nication between the two news servers, but the records were replicated as soon as connectivity was restored. 
Retrieving messages was as simple as pointing a news reader to the news server, and subscribing to the 
"repository.odu.testl" news group. 



3.2 The Email Prototype 

The mechanics of taking an email message from the email queue, attaching the archive content, and reinsert- 
ing it into the queue are depicted in Figure 2(a). The corresponding extraction of the archive attachment 
can be seen in Figure 2(b). We ran live tests, using Postfix mail servers and a test archive to gather our 
data. Note the OAI-PMH style X-headers that are a part of the email message; these are similar to the 
X-headers of the news-method messages. The few differences are due to the specific header limitations and 
requirements of each protocol. 

Archiving records by piggybacking on normal email traffic requires sufficient volume to support the effort. 
Analysis of outbound email traffic from our department during a 30-day period showed 505,987 outgoing 
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messages to 4,493 unique hosts, with a daily mean frequency of 16,866 emails and a standard deviation of 
5,147. In Figure 3, the total number of emails sent to each domain is shown, along with a curve fit. A typical 
power law relationship was evident between the domain's rank and email volume sent to that domain. 

V.^c{n-') (1) 

Using the curve fit shown in Figure 3, b = 1.6. Please see the Appendix for the list of the top 50 domains 
and volume of email sent to each. For further discussion it becomes necessary to calculate the amount of 
emails that are actually sent to a certain domain per day. The Euler zeta function: 

oo 

c(fc) -Y.-, (2) 

n=l 

can be used to derive the constant c regarding the overall email traffic volume V: 

There are a number of processing parameters which can be tweaked while running the prototype. One 
factor is what we call "granularity" {G in Table 2). This factor is in our prototype by definition always 
unequal to zero. The "normal" case would be G = 1 which means every single email is selected to get an 
attachment of an harvested object. G can be less than zero in which case only every G"* email is attached 
with an replication object. If, for example, G = 0.5 only every other email is selected. On the other hand 
G can be greater than one, e.g. G = 3 in which case three objects would be attached to every single email. 
Granularity G consequently either functions as a damping or accelerating factor considering the pace of 
repository replication. The effect of G < 1 on the average time to deliver one email is shown in Figure 4. 
With a lower G (less emails selected for an attachment) the average delivery time decreases. 

The prototype is further able to maintain a history list (pointer) for each destination site. Once this 
feature enabled, it guarantees that one destination domain does not receive duplicate records. The concept 
of a history pointer is further explained in section 5.2. 



3.3 Prototype Results 

Having created tools for harvesting the records from our sample digital library, and having used them to 
replicate the repository, we were able to measure the results. How fast is each prototype and what penalties 
are incurred? 
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Figure 5: Penalty on average delivery time of one email 

Using our NNTP prototype replication tool, we tested posting messages in a variety of sizes. The live 
experiment was run more than 20 times during a course of 6 months. The total time (Tnews) to harvest a 
record, encode it in base64, transmit it, and post it to the news server ranged from 0.5 seconds (12 kb) to 
26.4 seconds (4.9MB). Of course, the total time to complete a baseline harvest of the repository varied with 
the bandwidth available during each experiment, ranging from 22.7 minutes to 30.9 minutes with a mean 
time of 23.8 minutes, standard deviation of 2.6 minutes, and median time of 22.9 minutes. 

In our email experiment, we measured approximately a 1 second delay in processing email attachments of 
sizes up to 5MB (see Figure 5). Since the repository consisted of only 72 files and each file was less than 5MB 
Temaii, the time to complete a baseline harvest using email, is rapid: Only 72 emails need to be generated 
locally, which is a small fraction of the normal email traffic generated by the department. 

Besides the trivial linear relationship between repository size and replication time, we found that even 
very detailed X-Headers do not add a significant burden to the process. Not only are they small relative 
to record size (a few bytes vs. kilobytes or more), but they are also quickly generated (less than 0.001 
seconds per record) and incorporated into the archival message. Both NNTP and SMTP protocols are 
robust, with most products (like INN or sendmail) automatically handling occasional network outages or 
temporary unavailability of the destination host. Our experimental results formed the basis of a series of 
simulations using email and Usenet to replicate a repository. 

4 Simulating The Archiving Process 

When transitioning from live, instrumented systems to simulations, there are a number of variables that 
must be taken into consideration in order to arrive at realistic figures (Table 2). Repositories vary greatly 
in size, rate of updates and additions, and number of records. Regardless of the archiving method, a 
repository will have specific policies ("Sender Policies") covering the number of copies replicated; how often 
each copy is refreshed; whether intermediate updates are sent between full backups; and other institutional- 
specific requirements such as geographic location of archives and "sleep time" (delay) between the end of one 
completed archive task and the start of another. The receiving agent will have its own "Receiver Policies" 
such as limits on individual message size, length of time messages live on the server, and whether messages 
are processed by batch or individually at the time of arrival. 

A key difference between news-based and email-based replication is the active-vs-passive nature of the 
two approaches. This difference is reflected in the policies and how they impact the archiving process under 
each method. A "baseline," refers to making a complete snapshot of a repository. A "cyclic baseline" is 
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Table 2: Simulation variables 
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Figure 6: Impact of sender & receiver policies on NNTP replication 



the process of repeating the snapshot over and over again (5 = 0), which may result in the receiver storing 
more than one copy of the repository. Of course, most repositories are not static. Repeating baselines 
will capture new additions {Ra) and updates {Ru) with each new baseline. The process could also "sleep" 
between baselines {S > 0), sending only changed content during the interim, or none at all. In short, the 
changing nature of the repository can be accounted for when defining its replication policies. 

4.1 Archiving Using NNTP 

The time to complete a baseline using news is obviously constrained not only by its modification rate {Ra 
and Ru), but also by the size of the repository and the speed of the network. Consider Figure 6(a) which 
illustrates the generalized replication timeline for three different sender policies. Baseline replication is only 
successful when the news server message life {Nta) is larger than Tnews- Figure 6(b) shows how different 
message life limits can impact the feasibility of archiving the repository on a news server under different 
sender policies. The red line (marked "X") shows a message life that is not long enough for a single baseline 
to complete - i.e., that Tnews is too large for the target news server. Line "Y" (the green line) represents a 
longer message life than line X, but there is still not enough time for the server to "sleep" between baseline 



8 



archives. If the harvest restarts immediately on completion of the first baseline, a full copy can be maintained 
on the news server despite its message deletion rate. Repository growth could quickly outpace this balance. 
Finally, line "Z" (the blue line) is long enough to allow two complete baselines (copies) to be sent, with a 
short sleep period between the baselines. A successful NNTP-based replication strategy will balance Ntti, 
Tnewsi and the repository's modification rate {Ra + Ry)- 

Working with the variables from Table 2, we can develop a general formula to estimate the total number 
of records harvested from the repository and posted as news articles during D days. These equations capture 
only discrete values and not transmissions in progress: 

(MaxK \ 
< D (4) 

H^.= (l + %^)w^(.-i) (5) 

W^ = -^ + S (6) 

For the sleep cycle, S, the value varies by sender policy: 

S = => continuous baseline (7) 

S = D cyclic baseline every D days with updates (8) 

S = oo single baseline only (9) 

The total number of records currently replicated at a particular news server N on a given day D takes into 
account the life time {Ntu) of news messages on that server: 

TRnews at server N = TRnewsiD) - TRnews{D - Ntu) (10) 

Nearly all repositories will have daily updates and new additions that need to be accounted for when 
determining Tnews- Even "static" repositories which do not accept new entries are likely to have a certain 
amount of periodic record modification as errors, for example, are found and corrected. A larger time gap 
between baseline harvest completion and news message expiration will give the harvesting repository more 
"room to grow" before the two timelines collide. 

NNTP is an older protocol, with limits on line length and content which impact building the news 
messages. Converting binary content to base64 overcomes such restrictions but at the cost of increased file 
size (one-third) and replication time. Even though storage costs continue to decline, a complete baseline 
harvest with its associated metadata and base64 encoding could prove too large for a news server to support. 
On the other hand, the web infrastructure has a number of participants (Google Groups, for example) which 
are interested in maintaining cached versions of even very large sites. In this case, a single baseline with 
updates could prove to be an acceptable strategy for a repository. 



4.2 Archiving Using SMTP 

One major difference in using email as the replication tool instead of news is that email is passive, not 
active: the email approach relies on existing traffic between the host site and one or more target destination 
sites. Fortunately, the prototype is able to attach files automatically with just a small processing delay 
penalty of less than 1 second. As it turns out though, maintaining a replication list (history pointer) for 
each destination site is critical if a baseline harvest is to be completed. 

Using the variables defined in Table 2, we can develop a general formula to estimate the total number of 
records harvested in D days to a specific destination: 

D 

TRemail = ^ Q email X h{D) (11) 

71=1 
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Qemail = (^) X G (12) 

< /i(d) < 1 no record history pointer maintained (13) 



/i(d) = 1 =i- record history maintainted (14) 

If the history hst is maintained for every receiver domain, then the pointer value is equal to 1, as indicated 
in Equation 4.2; but if the history pointer is not maintained, then the value varies between and 1 (zc;ro 
and one) as shown in Equation 4.2. The value of /i(d) is derived in Equation 21. Unlike news, which is 
time oriented, the email approach is destination oriented. Granularity (G) and history-pointer values (h) are 
important factors when calculating the replication estimate. Completing a baseline using email is subject 
to the same constraints as news - repository size, number of records, etc. - but is particularly sensitive 
to changes in email volume. For example, holidays are often used for administrative tasks since they are 
typically "slow" periods, but there is little email generated during holidays so repository replication would 
be slowed rather than accelerated. On the other hand, the large number of unique destination hosts means 
that email is well adapted to repository discovery through advertising. In a single day, information about 
the repository can be disseminated to thousands of potential preservation partners. 



5 Simulation Results 

In addition to an instrumented prototype, we simulated a repository profile similar to some of the largest 
publicly harvestable OAI-PMH repositories. The simulation assumed a 100 gigabyte repository with 100,000 
items {R = 100000, Rg = 1MB); a low-end bandwidth of 1.5 megabits per second; an average daily update 
rate of 0.4% (_R„ = 400); an average daily new-content rate of 0.1% {Ra = 100); and a news-server posting 
life {Nfti) of 30 days. For simulating email replication, our estimates were based on the results of our email 
experiments: Granularity G = 1, an average of 16866 total outgoing emails per day, and the power-law factor 
applied to the ranks of receiving hosts. We ran the NNTP and SMTP simulations for the equivalent of 2000 
days (5.5 years). 



5.1 Policy Impact on NNTP-Based Archiving 

News-based replication is constrained primarily by network capacity and limits imposed by the receiving news 

server. Except for inter-party agreements or some other trans-organizational coordination, the receiver's 
policies, even when they are known, are usually unconfigurable by the sender. A local news server can 
influence remote servers by establishing its own Ntti, size limits, content type, etc. A news server may adopt 
some of the policies of the source server it is replicating, allowing posts to expire earlier than the local 
server's Ntti, but usually not allowing the posts to live longer. Ultimately, the archivist must consider the 
balance between the repository's capacity to replicate via NNTP and the news server's ability to support 
replication. 

As Figure 4.1 illustrated, successful replication depends on Tnews being smaller than Ntu- We can 
estimate Tnews using the average record size in the repository (iij) times the total number of records (i?) 
and the base64 encapsulation factor (|), divided by the net available bandwidth (i/): 

Rx RjX ^ 

If the lifetime of a posting is shorter than the baseline harvesting time of the repository {Nui < Tnews), then 
that news server will never hold a complete copy of the repository on any given day. 

Another potential issue is that the sheer size of the repository may make full-content replication to a 
news server impractical because of limits in available processing time or host storage capacity, for example. 
In such a situation the repository could adopt a "By-Reference" archiving policy. This approach is fast and 
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efficient, since it stores only the metadata for each repository record rather than the content of the record. 
Using Equation 15, we see that a repository with R = 500, 000 records and per-record metadata of 1 Kilobyte 
can be archived in less than 1 day (ignoring updates and additions) at speeds as slow as a dial-up modem 
(0.125 Mbps): 

500, OOOi^ecords X 1KB 

0.37Days = ' (16) 

^ 0.125Mbps ^ ^ 

For very large and/or very active repositories, this kind of "advertising" may be the optimum solution. 

In general, the probability of a given repository record being currently replicated on a specific news server 
N on day D is a function of the number of records posted each day to the news server (Qnews), the growth 
of the repository (Ra) during those D days, and the lifetime of the record on the server {Nui): 

news 

xD)-Q 

news 

^""^ R+{DxRa) ^ ' 

Figure 7(b) illustrates how a sufficient grace period {Nui = 30) can support different repository replication 

(sender) policies. In one scenario, continuous baselines are transmitted. New and/or modified records are 
queued as they occur. Both the "Cyclic Baseline with Updates" and the "Repeating Baseline" approaches 
eventually result in a steady-state amount of data existing on the news server. This amount is approximately 
equal to the bandwidth available between the repository and the news server, and is a gradually declining 
percentage of the repository as it continues to grow and modify records. 

For the "Cyclic Baseline with Updates" line in Figure 7(b), we simulated a 6- week repeating cycle with 
certain "sender policies" : The entire repository is replicated twice, followed by updates only, then the cycle 
is repeated. With this approach, the news server maintains between one and 2 full copies of the repository, 
at least for the first few years. 

The worst replication performance can be seen in the "single baseline with updates" line of Figure 7(b). 
In this third approach, the policy is to make a single baseline copy i.e., a one-time event, which is followed 
by only record updates and new additions. Even these are eventually removed from the news server as 
they reach the limit of Nui- The result is a rapidly decreasing percentage of repository replication over 
time. Eventually, only 30-days' worth of new data exists on the server, since Nui = 30. Usually, this would 
be a very small portion of the repository compared with the other two policies, which can maintain up to 
Ntti X Qnews versus Ntti x (Ra + Ru), for example. 

It is obvious that as a repository grows and other factors such as bandwidth and news posting time remain 
constant, the news server eventually contains less than 100% of the library's content, even with a policy of 
continuous updates. Nonetheless, a significant portion of the repository remains replicated for many years 
if the news server has a sufficient Nui- 



5.2 Policy Impact on SMTP-Based Archiving 

SMTP-based replication is constrained not only by the frequency of outbound emails, but also by the policies 
adopted by the repository. Consider the following two sender policies: The first policy maintains just one 
queue where items of the repository are being attached to every G*'* email regardless of the receiver domain. 
This policy also randomly assigns a record, without maintaining a history pointer of records which have 
already been replicated. This is the easiest policy to implement since no history pointers are maintained, 
but it will take much longer for a particular domain to receive all records since many duplicate records will 
likely be sent while unsent record remain. In the second policy, we have more than one queue where we 
keep a pointer for every receiver domain and attach items to every G*^ email going out to these particular 
domains. Thus, domain X will receive a new record in each attachment. Duplicates will only begin once 
a baseline to that domain has completed. The second policy allows each receiving domain to converge on 
100% coverage much faster. However, this efficiency comes at the expense of the sending repository tracking 
separate queues for each receiving domain. 

The impact of email's power law distribution is readily seen when comparing the coverage of higher- 
frequency ranks (1 through 5, for example) with lower-frequency ranks. Receiver domains ranked 2 and 
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Figure 7: Replicating an extremely active repository 
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(a) Without record history (b) With record history 

Figure 8: Time to receive 100% repository coverage by email domain rank 



3 achieve 100% repository coverage fairly soon but Rank 20 takes significantly longer (2000 days with a 
history pointer), reaching only 60% if no history pointer is maintained. Figure 8(a) shows the time it takes 
for a domain to receive all files of a repository without the history pointer and Figure 8(b) shows the same 
setup with a history pointer. In both graphs, the f** ranked receiver domain is left out because it represents 
internal email traffic. 

Figure 8(a) clearly shows the impact of failing to maintain a record history. Since there is a decreasing 
statistical likelihood of a new record being selected from the remaining records as the process progresses, it 
becomes less and less likely that a baseline harvest can be reached. Thus, feasibility of replication via email 
Q email is a fuuctiou of the receiver's rank (k), the granularity (G), and probability based on use of a history 
pointer (h). Working with the values obtained from our experiments where b — 1.6 and total email volume 
per day — 16866, and Equation 3, we find that the value of the constant c is 7378.7 ; this value can now be 
used to determine the number of emails sent per day for each receiver domain by rank k: 



7378 



i email 



^1.6 



X G 



(18) 



A rank of 3, for example, would mean 1,272 emails per day to that host. The total number of records 
replicated on day D is: 



Qemail X h{D) 



(19) 



To give us a good opportunity to complete a baseline, we can set h — 1 and G ~ 1. In other words, 
we maintain a history pointer, and we do not skip any emails. This ensures that we do not send duplicate 
records before a baseline of the entire repository has been completed, and that we take full advantage of 
email traffic to that domain. It is obvious that increasing G would shift the graphs of both Figures 8(a) and 
8(b) up and decreasing it would just shift them down. Using these values, we can calculate the probability 
that a record has been replicated via email: 



(^Qemail ^ ^ 



R+{D X Ra) 



(20) 



What if no history pointer is maintained? In that case, we need to include the probability that a new 
record is attached to a given email, meaning h(D) is no longer one. The equation for is is a recursive 
calculation since it needs to account for the number of records already sent compared with the number of 
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remaining, unscnt records i.e., non-duplicates. For simplicity, we assume that no duplicates are sent on the 
first day (Equation 22). 



h{D) = 



[R + {{Ru + Rg) X D)] ~ Q. 
[R + {{Ru + Ra) X D)] 



email 



X h{D - 1) 



(21) 



Ml) = 1 



(22) 



In summary, one can argue that email may not be a practical solution for repository replication since 
the lower ranked domains will not get a full replication of a good sized repository in a reasonable time. The 
email approach does have a unique advantage: it offers a large number of hosts where the repository can be 
advertised. 

6 Other Repository Scenarios 

The scenarios we have described so far in this paper involve an unusually active repository, one which 
experiences a high rate of change in the form of new additions and updates to existing records. Our 
hypothetical repository doubles in size in only 1000 days (just under 3 years). We also used a relatively slow 
average network speed (most institutions and even home users will have much higher average bandwidth), 
and further added an average 25% daily network down time. In other words, we stacked the deck against 
the NNTP and SMTP replication methods. Despite these obstacles, the repository continues to be fully 
replicated on the news server for over 2 years. 

Email as a replication tool poses several problems such as the passive nature of the process (waiting for 
emails to be generated), and uncertainty about the persistence of the record on the receiving host. On the 
other hand, the large number of domains that receive emails make this approach very compatible with a 
strategy of preservation-by-advertising: The greater the number of sites that are aware of a repository, the 
greater the likelihood that the repository will be found by interested users and - perhaps - replicated. 

How would these approaches work with other repository scenarios? If the archive were substantially 
smaller (10,000 records with a total size of 15 GB), the time to upload a complete baseline would also be 
proportionately smaller since, as we noted earlier, the replication time is linear with respect to the repository's 
size for both the news and email methods of replication. The news approach actively traverses the repository, 
creating its own news posts, and is therefore constrained primarily by bandwidth to the news server or limits 
on posts imposed by the news server. Email, on the other hand, passively waits for existing email traffic and 
then "hitches a ride" to the destination host. The SMTP approach is dependent on the site's daily email 
traffic to the host, and a reduction in the number of records has a bigger impact if the repository uses the 
email solution because fewer emails will be needed to complete a baseline harvest of the repository. 

6.1 A Mature Repository 

Consider a mature repository with an initial size of 1 million records averaging IOOkb each (totaling 95 GB 
of data). If the repository experiences a relatively low level of activity (10 new records (O.OOIgb) and 5 
modifications (0.0005GB)), the sender can maintain at least 3 copies of the repository, including changes, for 
over 5 years using the NNTP method. As before, we simulate a fairly low bandwidth (10 GB per day max 
capacity). The column "Mature" in Table 3 lists the repository values and the policy factors for both sender 
and receiver. 

Figure 9(b) illustrates a simulation using these values sent to a news server with the usual 30-day 
expiration time. The single baseline policy drops off because of the deletion of records from the news server 
every 30 days, but the cyclic and repeating baselines easily keep up with the deletion process throughout 
the 2000-day simulation. As Table 3 notes, the replication target for the repeating baseline is 3 copies, and 
for the cyclic baseline it is 2 copies. 

Figure 9(a) gives a more detailed look at the first 200 days. Notice that the cyclic baseline requires 
a few cycles before it settles down to maintaining about 2 copies on the news server. The peaks occur 
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Table 3: Values used in simulations 
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because record modifications are replicated as new posts, since previous news messages cannot be modified 
directly. The total volume sent to the news server is thus the combined s\im of records and the changes 
to those records. Results for the same mature repository using the SMTP method are shown in Figure 10. 
We can clearly see the impact of maintaining a pointer (Figure 10(b)) versus without tracking the history 
(Figure 10(a)). 

6.2 A New, Growing Repository 

The web, of course, is full of new repositories that are fairly active in terms of adding new content and 
making routine updates every day. The column labeled "New" in Table 3 lists values for a hypothetical new 
repository. It starts out fairly small (only 1000 records (0.1 Gb)), but adds new records at a higher rate 
than the mature repository (100 records (~ 10M_B) per day). Similarly, modifications to the new repository 
happen at a similarly higher rate (20 records (~ 2MB) per day) than they do in the mature repository. 
Although it would be reasonable to expect the high rate of change to slow over time as the repository 
matures, we maintained this high activity level throughout the 2000 days of the simulation. 

Figure 11(b) shows the impact of sender policies on maintaining replicated baselines at the news server. 
Despite the high activity rate, both the cyclic baseline and the continuous baseline policies manage to keep 
up with the job of replication for the entire simulation period. Although the news server can no longer 
maintain 3 full copies of the repository with the continuous baseline strategy toward the end of the period, 
the news server retains at least one full copy of the repository for the entire time frame. 

Figure 11(a) gives a closer look at the first 200 days of the simulation. The graph clearly shows the 
impact of "sleeping" between the cyclic baselines: During the sleep period, many new records and updates 
are created, and records that were replicated earlier reach their Ntti- This stabilizes eventually, since even 
such a low bandwidth can push 10 GB per day to the news server. In other words, the repository can make up 
for lost time during the next "awake" cycle. Compare these figures with performance for the same growing 
repository using the SMTP method, as shown in Figure 12. Again, the impact of maintaining a pointer 
(Figure 12(b)) versus without tracking the history (Figure 12(a)) is obvious. 

6.3 Advertising the Repository 

One problem that repositories often face is how to improve their general visibility to other sites and potential 
clients. Buried beneath a host of other, competing resources, repositories can become like the Dead Sea 
Scrolls, hidden for digital decades. Both the news and the email methods of replication can help solve 
this problem using features unique to OAl-PMH: Email, by virtue of disseminating information about the 
repository to a wide number of hosts; news, thanks to the wide-ranging accessibility of Usenet. The OAI- 
PMH "Identify" response could be effectively used to advertise the existence of a repository regardless of 



15 




50 



1 

rime iDsys] 



(a) The First 200 Days 



1 50 



2O0 








—\ — 

5O0 



single baseline witti updai^s 
c^^ lie baseline wiih updfties 

iiepcisilDHJsi2e 



1 

1000 

rime iDsys] 



(b) Replication During 2000 Days 

Figure 9: Replicating a mature repository using NNTP 
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Figure 10: Replicating a mature repository using SMTP 
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Figure 11: Replicating a new, growing repository using NNTP 
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the replication approach or pohcics. After the repository was discovered, it coiild be harvested via normal 
means. This method can advertise even very large repositories, since only metadata is replicated. A simple 
OAI-PMH "Identify" record is very small (a few kilobytes at most) and would successfully publish the 
repository's existence in almost zero time regardless of the replication approach that was used. 

7 Future Work 

Through prototypes and simulation, we have studied the feasibility of replicating repository contents using 
the installed NNTP and SMTP infrastructure. Our initial results are promising and suggest areas for 
future study. In particular, we must explore the trade-off between implementation simplicity and increased 
repository coverage. For the SMTP approach, this could involve the receiving email domains informing the 
sender (via email) that they are receiving and processing attachments. This would allow the sender to adjust 
its policies to favor those sites. For NNTP, we would like to test varying the sending policies over time as 
well as dynamically altering the time between baseline harvests and transmission of update and additions. 
Furthermore, we plan to revisit the structure of the objects that are transmitted, including taking advantage 
of the evolving research in preparing complex digital objects for preservation [15] [16]. 

8 Conclusions 

It is unlikely that a single, superior method for digital preservation will emerge. Several concurrent, low- 
cost approaches are more likely to increase the chances of preserving content into the future. We believe the 
piggyback methods we have explored here can be either a simple approach to preservation, or a complement to 
existing methods such as LOCKSS, especially for content unencumbered by restrictive intellectual property 
rights. Even if NNTP and SMTP are not used for resource transport, they can be effectively used for 
repository awareness. We have not explored what the receiving sites do with the content once it has been 
received. In most cases, it is presumably unpacked from its NNTP or SMTP representation and ingested 
into a local repository. On the other hand, sites with apparently infinite storage capacity such as Google 
Groups could function as long-term archives for the encoded repository contents. 
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Appendices 



A Including OAI-PMH Headers in Email and News 
A.l Headers in News Messages 

The actual text of the news message is formed and transmitted according to the specification RFC 855[f7]. 
Here are the headers from an actual message, followed by a snippet of the base64 encoded resource (a JPEG 
in this case): 

Subject :http : //beatitude . cs . odu.edu : 8080/inodoai/lO/ : : 1155219621 . 6635 

From:DigLib Mgr <dlmgr@beaufort.cs.odu.edu> 
Date:Thu, 10 Aug 2006 14:03:45 +0000 (UTC) 
Newsgroups : repository . odu. test 1 

Path: beatitude . cs . odu. edulbeauf ort . cs . odu. edu!not-f or-mail 
Newsgroups : repository . odu . test 1 
OrgcLnization: ODU DLib 
Lines: 382 

Message-ID : <ebf ecl$rbl$lQbeauf ort . cs . odu. edu> 
NNTP-Posting-Host : ip70-161-100-170 .hr.hr. cox .net 

X-Trace: beaufort.cs.odu.edu 1155218625 28001 70.161.100.170 (10 Aug 2006 14:03:45 GMT) 
X-Complaints-To :news®beauf ort . cs . odu. edu 
NNTP-Posting-Date:Thu, 10 Aug 2006 14:03:45 +0000 (UTC) 

X-Harvest_Time : 2006-8-10T14 : 20 : 24Z 

X-baseURL : http : //beatitude . cs . odu . edu : 8080/modoai/ 10/ 
X-OAI-PMH_verb : GetRecord 

X-OAI-PMH_metadataPref ix:oai_didl 

X-OAI-PMH_Ident if ier : http : //beatitude . cs . odu . edu : 8080/ j _image . j pg 
X-sourceURL : http : //beatitude . cs . odu . edu : 8080/inodoai/10/?verb=GetRecord&i 
X-sourceURL-1 : dentif ier=httpy.3A7.2Fy.2Fbeatitude . cs . odu . edu7.3A80807.2Fj .image 
X-sourceURL-2 : . jpg&metadataPref ix=oai_didl 
X-HTTP-Header: HTTP/ 1.1 200 OK 

Xref : beatitude . cs . odu. edu repository . odu . testl : 9434 

PD94bWwgdmVyc21vbj0iMS4wIiBlbinNvZGluZz0iVVRGLTgiPz4KPE9BSSlQTUggeGlsbnM9Iinh0 
dHA6Ly93d3cub3BlbmFyY2hpdmVzLm9yZy9PQUkvMi4wLyIgeGlsbnM6eHNpPSJodHRw0i8vd3d3 
LnczLni9yZy8yMDAxLlhNTFNj aGVtYSlpbnNOYWbj ZSIgeHNpGnNj aGVtYUxvY2F0aW9uPS JodHRw 
0i8vd3d3Lm9wZW5hcmNoaXZlcy5vcincvT0FJLzIuMC8gaHR0cDovL3d3dy5vcGVuYXJjaG12ZXMu 
b3JnL09BSS8yLjAvT0FJLVBNSC54c2qiPgo8cmVzcG9uc2VEYXRlPjIwMDYtMDgtMTBUMTQ6MTQ6 
MTJaPC9yZXNwb25zZURhdGU+CjxyZXFlZXN0IHZlcinI9IkdldFJlY29yZCIgaWRlbnRpZmllcj0i 
aHR0cDovL2JlYXRpdHVkZS5jcy5vZHUuZWR10jgw0DAval9pbWFnZS5qcGciIGlldGFkYXRhUHJl 
Zml4PSJvYWlfZGlkbCI+aHR0cDovL2JlYXRpdHVkZS5jcy5vZHUuZWR10jgw0DAvbW9kb2FpLzEw 
[...] 

The final section is the base64-encoded resource. We have copied only a few lines of that portion of the 
message (ending with "[...]") since it is very long and not human-readable. 

A. 2 Headers in Email Messages 

The raw text of an email message with an appended repository record is shown below. 

Date: Tue, 15 Aug 2006 13:09:10 -0400 (EDT) 
From: martin klein <mkleinQcs.odu.edu> 
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To: test address on then <iiik@th.endral. seven. research. odu.edu> 
Subject: test 

Message-ID : <Pine . GSO . 4 . 58 . 0608151309001 . 7234@isis . cs . odu. edu> 

MIME-Version: 1.0 

Content-Type: MULTIPART/MIXED; B0UNDARY="737000039-3138878811-3085330315=:7234" 
This message is in MIME format. The first part should be readable text, 

while the remaining parts are likely unreadable without MlME-aware tools. 

Send mail to mime@docserver.cac.washington.edu for more info. 

X-Harvest_Time : 2006-8-15T17 : 9 : 12Z 

X-baseURL : http : / /beatitude . cs . odu . edu : 8080/modoai/ 

X-OAI-PMH_verb: GetRecord 

X-OAl-PMH_metadataPref ix: oai_didl 

X-OAI-PMH_Identif ier : http : //beatitude . cs . odu. edu : 8080/10/pgl0-8 .pdf 

X-sourceURL : http : //beatitude . cs . odu . edu : 8080/modoai/?verb=GetRecord&// 

identif ier=http : //beatitude . cs . odu. edu: 8080/ lO/pglO-8 .pdf &metadataPrefix=oai_didl 

X-HTTP-Header : HTTP/ 1.1 200 OK 

Date: Tue, 15 Aug 2006 17:04:47 GMT 

Server: Apache/2.0.49 (Fedora) 

Content-Length: 6745 

Connection: close 

Cont ent -Type : t ext /xml 

—737000039-3138878811-3085330315= : 7234 
Cont ent -Type : TEXT/PLAIN; charset=US-ASCII 

this is a test msg 

martin 

perl -1 -e 'print join "", reverse split //, "!nuf evah" ' 

—737000039-3138878811-3085330315= : 7234 
Content-Type: x-application/myxml; charset=US-ASCII ; // 
name="http : //beatitude . cs . odu . edu: 8080/ lO/pglO-8 . pdf" 
Content-Transfer-Encoding: BASE64 

Content-Description: application/xml 

Content-Disposition: attachment; f ilename="e247c91802a684f8fellccc4eab74978.xml" 
PD94bWwgdmVyc21vbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4K 

PE9BSSlQTUggeGlsbnM9Imh0dHA6Ly93d3cub3BlbmFyY2hpdmVzLm9yZy9PQUkvMi4wLylgeGls 

bnM6eHNpPSJodHRw0i8vd3d3LnczLm9yZy8yMDAxLlhMTFNjaGVtYSlpbnM0YW5jZSIgeHNp0nNj 

aGVtYUxvY2F0aW9uPSJodHRw0i8vd3d3Lm9wZW5hcmNoaXZlcy5vcmcvT0FJLzIuMC8gaHR0cDov 

L3d3dy5vcGVuYXJjaG12ZXMub3JnL09BSS8yLjAvT0FJLVBNSC54c2QiPgo= 

PHJlc3BvbnNlRGF0ZT4yMDA2LTA4LTElVDE30jA00jQ3WjwvcmVzcG9uc2VEYXRlPgo= 

PH Jl cXVl c3QgdmVyY j 0iR2V0UmV j b3 Jkl iBpZGVudGlmaWVyPS JodHRw0i8vYmVhdG10dWRlLmNz 

[...] 

—737000039-3138878811-3085330315= : 7234— 

As with the news message sample, the base64-encoded portion of the message is only partially shown in this 
email. 
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B Email Traffic Data 



In section 3.2 we analyzed the outgoing email traffic of the Computer Science Department at Old Dominion 

University over a period of 30 days (January 29*^ 2006 through February 27*'* 2006). Fi gures 13(a) and 13(b) 
depict the department's outbound email traffic. Note that Figure 13(a) shows a nearly linear relationship 
between the cumulative amount of new receiver domains (scaled on the left y-axis) and the cumulative 
amount of emails (the right y-axis) sent within the observed time frame. In figure 13(b) we can see the 
amount of different receiver domains per day (left y-axis) compared to the amount of emails (right y-axis) 
sent per day. In both figures day one represents January 29*'* and day 30 February 27*''. 

In Figure 13(b) two dramatic decreases in the amount of emails sent as well as in the amount of new 
receiver domains are visible. Although we do not have a plausible explanation for the second low point 
on Thursday, February 9*'* with just 5271 outbound emails, there is a good reason for the first, even more 
dramatic low point of just 828 outboimd emails on February 5*'': it was Super Bowl Sunday. These two 
distinctive points are also visible in Figure 13(a), where the cumulative value for emails is close to zero 
compared to all other days. 

Table 4 shows the top 50 ranked receiver domains. The internal email traffic is dominant followed by 
famous email providers like Yahoo! and Gmail. Ignoring internal emails (i.e., odu.edu), only 5 universities 
appear in the top 50, with the highest ranking university at rank 33. These points support the argument 
that email might rather be applicable for repository advertisement than efficient repository replication. 
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Table 



4: Top 50 ranked 



receiver domains at ODU CS Department email 



Rank 


Emails 


Domain 


1 


220582 


ODU.EDU 


2 


36508 


YAHOO. COM 


3 


30955 


GM AIL. COM 


4 


14045 


COX.NEl 


5 


9960 


PRADELLA.BIZ 


6 


8094 


VERIZON. NEi 


7 


3946 


COMCAbT.NET 


8 


3478 


HO 1 MA1L.COM 


9 


3238 


POBOa.COM 


10 


3178 


BOUNCE.NITENIGHTPROMO.COM 


ii 


3164 


0( 33.COM 


12 


3009 


ACM.ORG 


13 


2897 


BOUNCE.CHARIbMADIRINC.COM 


14 


2702 


BOUNCE.BLAYWAY.COM 


15 


2673 


IN iERNA110NALCbPED1110N.COM 


16 


2617 


BOUNCE.DIRECTGAUGEBLUE.COM 


17 


2555 


T' A T^T A A /T /" ^ / \ T\ yr 

1 AKLAM.COM 


18 


2289 


LARC.NAbA.GOV 


19 


2042 


bPEAKEAbY.NET 


20 


1987 


bYbABEND.ORG. 


21 


1983 


U ALCOMM.COM 


22 


1968 


GLAVES.ORG 


23 


1866 


BOUNCE.BLUEWAlERbKY.COM 


24 


1838 


CW.NEi 


25 


1828 


BOUNCE.TICKYTRACKY.COM 


26 


1804 


A T^T TirA AT A T\/^/^ ATT 

cable.wanadoo.nl 


27 


1765 


A T^O/^T T T'T'TTiA /T/^T'Ti^AT /^i^A /T 

ABbOLU iEM01iON.COM 


28 


1699 


AT A AT" O ATTTirn 

NAXb.NET 


29 


1643 


E-b 1 ANDARD.BIZ 


30 


1642 


/^T TATy^1~l TA /~VT~V TTr> /~V /^T^T~» ATT X 1 /" \ A T 

BOUNCE.DODGEROCKBALL.COM 


31 


1633 


"nTTOTmi T A TT /~1/^A (T 

FUSEMAIL.COM 


32 


1502 


T A O/^AT/^AT'TTTTTT' ATTT^T' 

JAbONON IHE.NEi 


33 


1501 


/^T A TV T A T TT/" 

cl.cam.ac.uk 


34 


1459 


/— \ Ti /T AA ATAT"n/~lr~rTT AA AT ATTTim 

COMCONNECTION.NET 


35 


1441 


A T^T~\ A rpi/~\ O /" 1 / \ T\ yr 

ABDAiOb.COM 


36 


1423 


A TTTT1T~»T~> A /^TT /^/~\A T 

AUERBACH.COM 


37 


1418 


BOUNCE.bKYBEACHTIE.COM 


38 


1394 


/^T TT> TO 'TTTT'ATO TT' AT A TiAyTO /^/~\A yT 

CHRlb iENbENARMb.COM 


39 


1358 


AT/^riT TT(~1/~1 T^T^ATT^m TAT 

NCbl.IIbC.ERNET.IN 


/in 




PWTT PriTT 


41 


1304 


BILLINGHAM-SL.COM 


42 


1216 


BARR-MULLIN.COM 


43 


1211 


EXODUS.NET 


44 


1175 


IGETSMART.COM 


45 


1134 


maths.anu.edu.au 


46 


1122 


B0UNCE.TUNETIMELAP.COM 


47 


1098 


virtua.com.br 


48 


950 


NSWC.NAVY.MIL 


49 


938 


kolache.cs.tamu.edu 


50 


936 


limited-onlineoffers .com 
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